home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Languguage OS 2
/
Languguage OS II Version 10-94 (Knowledge Media)(1994).ISO
/
language
/
parallax
/
docu_sho
< prev
next >
Wrap
Text File
|
1994-10-29
|
29KB
|
558 lines
Massively Parallel Programming with Parallaxis
----------------------------------------------
Thomas Braunl
Universitaet Stuttgart, IPVR
Breitwiesenstr. 20-22, D-7000 Stuttgart 80, FRG
e-mail: braunl@informatik.uni-stuttgart.de
### This paper only partially covers version 1 of the Parallaxis system,
### no version 2 extensions are being discussed!
Abstract
In the Parallaxis programming language, the model of a parallel architecture
is included as an integral part of the problem solution. That is, the
combination of algorithm and machine model accounts for an entire
specification. Since the choice of the computer architecture largely determines
the structure of the algorithm applied, a structured parallel programming
language should be given the expressive power of stating, or even better,
selecting the structure of the parallel architecture, for which an algorithm
is bound. Our language model Parallaxis operates on the class of SIMD array
processors. In addition to the concurrent algorithm, Parallaxis allows the
specification of arbitrary network topologies by means of a functional
description with configuration and connection specifications. Simple concepts
for concurrent execution of statements and message passing are based on these
definitions. Variable declaration takes the parallel machine model into account
and therefore splits into variables for the control unit (scalars) and
variables for each of the parallel processing units (vectors).
1. Introduction
The Parallaxis language model was designed to allow structured parallel programming in a high
level language, similar to sequential Pascal or Modula-2. The compiler should be ma-
chine-independent, so it can be used for a wide range of parallel architectures. In addition, it
should be useful in exploiting the parallel resources of any particular architecture satisfying the
basic machine model, in order to achieve optimal performance. Both goals (which might look
contradictory, at first) are achieved by a translation into a low level intermediate parallel lan-
guage which is processed by a machine-dependent interpreter or compiler. This flexible model
is restricted to SIMD structures; it combines the internal representation of the hardware struc-
ture with the topology- or structure-specific algorithm to form a complete problem solution.
Thus, it is possible to specify number and arrangement of processing elements, as well as their
communication network in Parallaxis.
Here, we get a new meaning of the term structured. The whole problem solution is structured,
including the algorithm, as well as the architecture for which the algorithm is bound. Given this
semantics, it is necessary to include a machine description, if the language should not be re-
stricted to a single architecture that is implicitly assumed. An algorithm for the same problem
will look totally different if designed for a hypercube instead of a ring-topology. Besides, for a
certain problem there may be structures well suited and others that do not match very well. If
there is a choice (as for reconfigurable parallel systems), the architecture specification can be
used for creating the best matching architecture.
In this paper, I will address the language concepts supporting the Parallaxis user in pro-
gramming a SIMD computer on a high level of abstraction. I will not be concerned at this
point, however, what strategy or algorithm to apply in order to perform an efficient mapping
between a topology specified for a certain application and the topology provided by the physi-
cal hardware structure. This problem is handled automatically on a lower operating system
level, transparent to the user. However, it will greatly affect the efficiency of an application
with heavy communication. H. J. Siegel developed efficient mapping algorithms for simulating
one topology on a different topology.
Corresponding to the machine model of a processor-network, controlled by a single control
unit, variables may be declared either for the control unit (using the declaration keyword
scalar), or for each of the PEs (using vector). Each variable is strictly typed, like in Modula-2,
so there may be no vector variables in an expression that is to be assigned to a scalar variable.
Vectors may be used inside explicitly marked parallel blocks or operations only while scalars
may also appear in parallel vector expressions (requiring a duplication or broadcast of that par-
ticular value).
2. Specification of Network Topologies
Parallaxis makes the following assumptions about its abstract parallel machine, whether it is
physically present or simulated:
* SIMD structure with
- central control unit
- variable number of processing elements (PEs)
- flexible (reroutable) communication network that is,
it can be used as if forming any topological structure
* all operations occur synchronously
* all PEs are identical in processor and memory structure
* each PE owns the same number of bi-directional ports
for sending and receiving messages
* one process per PE
Within the realm of this machine model, the programmer may now choose a certain topological
structure for the communication network to achieve the best possible match between ar-
chitecture and algorithm of a given problem. For each application, the number of PEs and the
network topology is static, that is, it has to be specified prior to compilation and will remain
unchanged during program execution.
The specification of the network structure's logical form takes two steps: First, Parallaxis is
told the number of processing elements one wants to use and how they will be arranged in di-
mensions. This almost exactly takes the form of Modula-2's array declaration, except that we
are dealing with processors here instead of data elements. Although we create some neighbor
relation between the PEs by this declaration, this does not specify any connections between
processors which is reserved for the second step. There, we may specify a transfer-function
(from a general PE position to its relative neighbor) for every processor-port in this topology.
Or, stated differently, the number of transfer-functions provided equals the number of ports per
PE. Each transfer-function has a name (the exit-port's name) and also states the name of the
corresponding entry port on its neighbor-PE after a period.
Parallaxis allows the specification of arbitrary network topologies. The examples shown in the
following sections should be guidelines for modeling other topologies.
2.1 Ring Structure
The simple topological structure of a ring may be specified as follows in Parallaxis:
CONFIGURATION ring [12];
CONNECTION c_wise : ring[i] -> ring[(i+1) mod 12].cc_wise;
cc_wise: ring[i] -> ring[(i-1) mod 12].c_wise;
11 --- 0 --- 1
| |
10 2
| |
9 3 <-- PE -->
| | cc_wise c_wise
8 4
| |
7 --- 6 --- 5
Figure 1: ring topology
With the configuration specification the user tells the system to reserve twelve processors, la-
beled zero to eleven, that are arranged in a single dimension. Each PE owns two ports, called
c_wise (for clockwise mapping) and cc_wise (for counter-clockwise mapping), corresponding
to these two transfer-functions. While the function c_wise maps any processing element to the
next higher PE, the function cc_wise maps any element to the next lower one. Using the mod-
ulo-operator results in a closed topology that is, every element has a neighbor at every port.
2.2 Two-Dimensional Grids
Now, let us take a look at some simple two-dimensional network structures: the grid, an open
topology that might be useful in application areas ranging from image processing to air flow
analysis for wing profile design, and its closed counterpart the torus (see figure 2).
^ ^ ^ ^ ^
| | | | |
grid X -- X -- X -- X -- X torus <- X -- X -- X -- X -- X ->
| | | | | (wrapped | | | | |
X -- X -- X -- X -- X around) <- X -- X -- X -- X -- X ->
| | | | | | | | | |
X -- X -- X -- X -- X <- X -- X -- X -- X -- X ->
| | | | | | | | | |
X -- X -- X -- X -- X <- X -- X -- X -- X -- X ->
| | | | |
v v v v v
Figure 2: two-dimensional grid and torus
Their specification in Parallaxis is straightforward, as shown in figure 3. It takes just four
transfer-functions to specify the whole grid with each PE owning four data-ports (here called:
north, south, east, and west). The torus specification is almost identical; the additional modulo-
operator connects opposite boundary-PEs. When considering the grid specification, there are
PEs whose neighbor function evaluates to an invalid PE-number: e.g. the PE at position [2,4]
would have a right neighbor [2,5], but this is out of bounds as of the grid size fixed in the
CONNECTION specification (rows labeled 0 to 3, and columns labeled 0 to 4). By definition,
these PEs (called "boundary-PEs") simply do not have a connection in that particular direction.
Any data exchange to or from the outside will have no effect (see section 4 for more detail).
CONFIGURATION grid [4] [5];
CONNECTION north: grid[i,j] -> grid[i+1, j].south;
south: grid[i,j] -> grid[i-1, j].north;
east : grid[i,j] -> grid[i, j+1].west;
west : grid[i,j] -> grid[i, j-1].east;
CONFIGURATION torus [4] [5];
CONNECTION north: torus[i,j] -> torus[(i+1) mod 4, j].south;
south: torus[i,j] -> torus[(i-1) mod 4, j].north;
east : torus[i,j] -> torus[i, (j+1) mod 5].west;
west : torus[i,j] -> torus[i, (j-1) mod 5].east;
Figure 3: grid and torus specification
3. Variable Declaration
As seen before, we have to deal with a basically two-fold system structure:
* a controlling host and
* a network of processing elements.
Variables that are used for controlling the execution sequence, such as a counting variable in a
for-loop, should exist only once at the host computer while a variable used for vector compu-
tations might be required once for each PE. To distinguish these two types of variable declara-
tions, Parallaxis offers the phrases scalar for declaring a variable at the host, and vector for
declaring a variable to appear in every PE's memory structure, thus overall creating a vector (or
some higher order data structure, like a matrix, etc., depending on the specified network topo-
logy). An example follows:
SCALAR i,j: integer;
VECTOR a,b: real;
c,d: integer;
The distinction between scalar and vector variables is also necessary for formal procedure pa-
rameters and local variable declarations.
4. Parallel Programming Concepts
Parallaxis provides language primitives to account for the parallel machine model which build
on the topological structure specified by means of configuration and connection. There are
three basic concepts for parallel processing in Parallaxis:
1. parallel execution ("parallel block")
2. parallel data exchange ("propagate operation")
3. vector reduction ("reduce function")
4.1 Parallel Execution
In analogy to the begin-end block used to group statements to be executed sequentially, there is
the parallel-endparallel block for synchronous parallel statements. The semantics hereby is that
every processing element executes the same statement with its own local values of the variables
involved, thus obtaining individual results. So the phrase parallel mustn't lead to an erroneous
interpretation: all statements contained in this block are executed sequentially under central
control, but they are performed data-parallel. Together with a unique control flow this just re-
flects the single instructionJP multiple data (SIMD) machine model.
Because of the SIMD restriction, each PE may execute the current instruction or remain idle; a
concurrent execution of different instructions is not possible. There are two ways of selecting
PEs for parallel execution in Parallaxis which may also be combined:
A) by explicitly stating their network position at the entrance of a parallel block
for each dimension specified
e.g., for a one-dimensional structure:
PARALLEL [22..44]
<parallel statements>
ENDPARALLEL
B) by using if-, case-selections or while-, repeat-loops with vector conditions
e.g.:
PARALLEL
IF <vector expression> THEN <parallel statements> END
ENDPARALLEL
In case A, only the PEs within the selected range execute the statements inside the parallel
block, all others remain idle during that time. PE selections may be constant or variable posi-
tional expressions, such as subrange, enumeration, and set. One selection is required for every
dimension.
In case B, a conditional expression determines which PEs will execute the statements of the
then-branch and which will remain idle. While the branching-condition evaluates to true for one
processor, it might be false for others. The condition is evaluated for each PE individually in
parallel and only those PEs for which the condition evaluates to true will execute the then-
branch; all other PEs remain idle. If there exists an else-branch, it will be executed subse-
quently with the inverse PE group. The two branches of an if-selection cannot be executed in
parallel because of the previously mentioned "single control flow" SIMD restriction. Therefore,
they have to be serialized. Processors that execute the then-part may continue while processors
executing the else-part are blocked with their identifications pushed onto a global stack at the
controlling host. They remain inactive while the host supervises execution of the then-part.
Afterwards, when executing the else-part, the processor sets change places, that is, the "then-
processors" now become inactive for some time while the "else-processors" are active. The use
of a dynamic stack also accounts for nested if-statements. Each stack-entry corresponds to a
nesting-level of if-statements. Since serialization degrades the performance of a parallel system,
the user should keep this fact in mind when designing a parallel algorithm.
The semantics of a parallel loop is analogous: only those PEs satisfying the loop-condition
execute the loop-statements. A loop can only be terminated when none of the active PEs satis-
fy the loop-condition. As long as a single PE remains, the loop is being continued while all
PEs excluded by the loop-condition are idle. This means, the controlling host has to get a feed-
back from the network whether there are PEs remaining or not. The implementation of this
feedback depends on the hardware facilities of the target system. In any case, the OR-reduction
of the condition-vector (see also section 4.3) returns this information to the host.
4.2 Parallel Data Exchange
The propagate-operation accounts for data exchange between the parallel processors. Propagate
behaves like a compound statement of send followed immediately by receive. All selected (or
active) PEs participate in this parallel data exchange. In its simple form, the propagate-opera-
tion reads:
propagate.<direction> ( <vector variable> )
e.g., for the ring structure of section 2.1:
propagate.c_wise (x)
The semantics of this statement is that the value of a vector (here: "x") is propagated through
the network by one step in the stated direction (here: "c_wise" for clockwise). The direction
names directly refer to the CONNECTION specification of the network where they were defined
as exit-ports. For simple network structures like ring, grid, hypercube, and so on, the corre-
sponding entry-port is unambiguous. Therefore, it can be determined automatically by Par-
allaxis (in our ring example: "cc_wise"). For complex topologies (as the tree structure, see
section 6) both send-port and receive-port have to be supplied, in order to perform the propa-
gate-operation.
Let us now assume that the PEs labeled 3 to 8 of the twelve ring PEs have been selected
(activated) to perform the above propagate-operation. Each PE sends a message (here: the local
value of variable "x") to its clockwise neighbor and then reads a message from its counter-
clockwise neighbor, since the port cc_wise was defined to be the entry for c_wise in the con-
nection specification. Disregarding those basic sends and receives, we recognize that informa-
tion has been propagated one shift in clockwise direction all around the selected sector, thus
reflecting the operation's name. Though the ring represents a closed topology, the sector se-
lected here is equivalent to an open topology. Interesting is the behavior at the boundary, which
in our example are PEs no. 3 and 8. PE no. 3 does not have an active neighbor
for receiving (direction "cc_wise") while PE no. 8 does not have a neighbor for sending
(direction "c_wise"). As of the propagate-operation, this is equivalent to not having a neighbor
at all for that direction in an open topology. PE no. 3 first sends its message along to PE no. 4.
Since it cannot perform a subsequent receive, its local value of "x" remains unchanged. PE
no. 8 's send goes without any effect (its neighbor is inactive) while the receive operation
can be performed normally.
Other syntactical form of the propagate-operation allow:
* the separation between send-expression and receive-variable
(in case one does not want to overwrite the variable as in the previous
form)
e.g.: propagate.c_wise (x+1,y)
* the propagation of several steps along a fixed direction
e.g.: propagate.c_wise ^5 (x)
4.3 Vector Reduction
Sometimes not the whole vector is of interest, but only a scalar reduction of it like its sum.
Without a specialized operation, this scalar information is awkward to get. Parallaxis provides
load and download operations to move a scalar array at the host to and from a vector distributed
over the network. So, in order to get the sum of a vector, one has to download the complete
vector to the host and then add it up iteratively. The reduce-function enables a vector reduction
in a single statement and without involving additional variables. The desired reduction opera-
tion has to be specified after a period. Predefined are sum, product, and, or, min, max, first,
and last; user defined functions with appropriate parameters may be used as well.
The reduce-operation is completely independent of the specified topology; it is a primitive, one-
dimensional vector operation. Specialized components of a parallel system may execute this
command in time O(log2 n) in a tree-like operation mode.
Assume "s" being a real scalar and "x" being a real vector:
s := REDUCE.sum (x)
This statement assigns the component-wise sum of vector "x" to scalar "s".
5. Sample Programs
Now, we are going to look at a couple of Parallaxis programs. The first shows
a parallel sorting algorithm, called "odd-even transposition sorting",
the second is a parallel version of the "Sieve of Eratosthenes" for
generating prime numbers.
"Odd-Even Transposition Sorting" (OETS) is a parallel sorting algorithm that is able to sort n
numbers on n PEs in time O(n). The PEs are connected in a bi-directional, open linear list; I/O
instructions have been omitted for clarity. In odd iteration steps, the PE-pairs 1-2, 3-4, and so
on are compared in parallel while in even iteration steps the pairs 2-3, 4-5, and so on are han-
dled.
SYSTEM sorting;
CONST n = 1000;
CONFIGURATION list [1..n];
CONNECTION left : list[i] -> list[i-1].right;
right: list[i] -> list[i+1].left;
SCALAR k : integer;
VECTOR val,r,l : integer;
swap : boolean;
BEGIN
... (* read input data *)
FOR k:=1 TO n DO
PARALLEL
PROPAGATE.right(val,l);
PROPAGATE.left (val,r);
(* l/r now hold the left/right neighbors' values *)
swap := false;
IF odd(k) THEN (* compare 1-2, 3-4, ... *)
IF odd(dim1) AND (r < val) THEN
val := r;
swap := true
END
ELSE (* even (k) compare 2-3, 4-5, ... *)
IF even(dim1) AND (r < val) THEN
val := r;
swap := true
END;
END;
PROPAGATE.right(swap);
IF swap AND (id_no>1) THEN val := l END;
ENDPARALLEL
END;
... (* write output data *)
END sorting.
Each PE holds one component of the vector "val" that is to be sorted, as well
as local copies of each one's left and right neighbor. The marker variable
"swap" is used for the bookkeeping of swap operations to be finished at the
right neighbor PEs. A different approach without marker propagation is
possible, but complicates the program.
The program for generating prime numbers is very much straight-forward and
does not need a lot of explanation. In each iteration through the "while"-
loop, all multiples of the current number are eliminated in parallel. Execution
of the loop continues until no "candidate" (on any PE) is left.
SYSTEM sieve;
CONFIGURATION list [1000];
CONNECTION (* none *);
SCALAR prime: integer;
VECTOR candidate: boolean;
BEGIN
PARALLEL
candidate := id_no >=2;
WHILE candidate DO
prime:= REDUCE.First(id_no);
WriteInt(prime,10); WriteLn;
IF id_no MOD prime = 0 THEN candidate:=FALSE END
END
ENDPARALLEL
END sieve.
6. Parallaxis Syntax
Parallaxis Language Definition
(c) Thomas Braunl, Universitaet Stuttgart, 1989
System = SYSTEM sys_ident ";"
{ ConstantDecl | TypeDecl }
HardwareDecl SoftwareDecl
sys_ident "." .
ConstantDecl = CONST { ident "=" ConstExpr ";" } .
TypeDecl = TYPE { ident "=" type ";" } .
HardwareDecl = CONFIGURATION conf_ident IntRange { "," IntRange } ";"
CONNECTION [ TransferFunc { ";" TransferFunc } ] ";" .
IntRange = "[" range "]" .
range = int_ConstExpr [ ".." int_ConstExpr ] .
TransferFunc = out_direction ":" conf_ident
"[" source { "," source } "]" ( "->" | "<->" )
destination { "," destination } .
direction = ident [ "(" (integer | const_ident) ")" ] .
source = ident | integer.
destination = [ discriminant ]
conf_ident "[" ExprList "]" "." in_direction.
discriminant = "{" bool_expr "}".
SoftwareDecl = VariableDecl { ProcedureDecl ";" }
block .
VariableDecl = [ ControlVarDecl ] [ LocalVarDecl ] .
ControlVarDecl = SCALAR { ident { "," ident } ":" type ";" } .
LocalVarDecl = VECTOR { ident { "," ident } ":" type ";" } .
ProcedureDecl = PROCEDURE proc_ident [FormalParams] ";"
{ ConstantDecl | TypeDecl }
SoftwareDecl proc_ident .
FormalParams = "(" [ parameters { ";" parameters } ] ")"
[ ":" ( SCALAR | VECTOR ) function_type ] .
parameters = SCALAR [ VAR ] ident { "," ident } ":" type |
VECTOR [ VAR ] ident { "," ident } ":" type .
block = BEGIN
StatementSeq
END .
StatementSeq = statement { ";" statement } .
statement = [ assignment | ProcedureCall |
IfSelection | CaseSelection | WhileLoop |
RepeatLoop | LoopStatement | ForLoop |
WithStatement | EXIT | RETURN [ expr ] |
ParallelExec | Propagate | Load | Store ] .
ParallelExec = PARALLEL selection
StatementSeq
ENDPARALLEL .
selection = [ "[" entry "]" { "," "[" entry "]" } ] .
entry = range { "," range } | expr [ ".." expr ] | "*" .
Propagate = PROPAGATE "." out_dirvar [ "^" ( integer | ident) ]
[ "." in_dirvar ]
"(" vector_designator [ "," vector_designator ] ")" .
dirvar = ident [ "(" expr ")" ] .
Load = LOAD selection "(" vector_designator "," scalar_designator
[ "," length_designator ] ")" .
Store = STORE selection "(" vector_designator "," scalar_designator
[ "," length_designator ] ")" .
Reduce = REDUCE "." operator_ident selection "(" expr ")" .
assignment = designator ":=" expr .
designator = ident { "[" ExprList "]" | "." ident } .
ProcedureCall = proc_ident [ "(" ExprList ")" ] .
FunctionCall = proc_ident "(" [ ExprList ] ")" .
IfSelection = IF bool_expr THEN StatementSeq
{ ELSIF bool_expr THEN StatementSeq }
[ ELSE StatementSeq ] END .
CaseSelection = CASE expr OF case { "|" case }
[ ELSE StatementSeq ] END.
case = CaseLabels { "," CaseLabels } ":" StatementSeq .
CaseLabels = ConstExpr [ ".." ConstExpr ] .
WhileLoop = WHILE bool_expr DO StatementSeq END .
RepeatLoop = REPEAT StatementSeq UNTIL bool_expr .
LoopStatement = LOOP StatementSeq END .
ForLoop = FOR ident ":=" expr TO expr [ BY ConstExpr ]
DO StatementSeq END .
WithStatement = WITH designator DO StatementSeq END .
type = SimpleType | ArrayType| RecordType | SetType .
SimpleType = type_ident | enumeration | subrange .
enumeration = "(" const_ident { "," const_ident } ")" .
subrange = "[" ConstExpr ".." ConstExpr "]" .
ArrayType = ARRAY SimpleType { "," SimpleType } OF type .
RecordType = RECORD FieldListSeq END .
FieldListSeq = FieldList { ";" FieldList } .
FieldList = [ ident { "," ident } ":" type ] .
(* variant records not yet supported *)
SetType = SET OF SimpleType .
ExprList = expr { "," expr } .
expr = SimpleExpr { relation SimpleExpr } .
relation = "=" | "<>" | "#" | "<" | ">" | "<=" | ">=" | IN .
SimpleExpr = [ "+" | "-" ] term { AddOperator term } .
AddOperator = "+" | "-" | OR .
term = power { MulOperator power} .
MulOperator = "*" | "/" | DIV | MOD | AND | "&" .
power = factor { "^" factor } .
factor = FunctionCall | Reduce | string | set |
number | designator | structure |
"(" expr ")" | NOT factor .
string = "'" { character } "'" | '"' { character } '"' .
set = [type_ident] "{" [ element { "," element } ] "}" .
element = ConstExpr [ ".." ConstExpr ] .
structure = record_ident "(" ExprList ")" .
CExprList = [ ConstExpr { "," ConstExpr } ] .
ConstExpr = SimpleConstExpr { relation SimpleConstExpr } .
SimpleConstExpr = [ "+" | "-" ] ConstTerm { AddOperator ConstTerm } .
ConstTerm = ConstPower { MulOperator ConstPower } .
ConstPower = ConstFactor { "^" ConstFactor } .
ConstFactor = const_ident | number | string | set |
recarr_ident "(" CExprList ")" |
stdfct_ident "(" CExprList ")" |
"(" ConstExpr ")" | NOT ConstFactor .
number = integer | real .
integer = digit {digit} .
real = digit {digit} "." {digit} ["E" ["+" | "-"] digit {digit}] .
ident = letter { letter | digit } .
character = letter | digit | "$" | .. (* any character, e.g. ASCII *) .
digit = "0" | . . | "9" .
letter = "A" | . . | "Z" | "a" | . . | "z" | "_" .